Fix bus error or segfault from roi_align with large batchsize by zy1git · Pull Request #9441 · pytorch/vision

zy1git · 2026-03-13T09:51:18Z

Summary
Bug: roi_align in torchvision crashes with a bus error/segfault on CPU or returns silently wrong (all-zero) results on CUDA when the total number of output elements exceeds INT_MAX (~2.1 billion). This is caused by 32-bit int overflow in index arithmetic within the C++ and CUDA kernels.

Root Cause: The kernels use int for composite index calculations like n × channels × pooled_width × pooled_height and pointer offsets like (roi_batch_ind × channels + c) × height × width. When these products exceed 2,147,483,647, the int wraps to a negative value, causing out-of-bounds memory access.

Example: FasterRCNN with batch_size=172 generates ~172,000 ROIs. The output index reaches 171,999 × 256 × 7 × 7 = 2,157,555,456 > INT_MAX, which matches the reporter's observed threshold exactly.

Fix: Promoted int to int64_t for all index, offset, and stride variables in the relevant files.

Test Plan
New overflow non-regression test
pytest test/test_ops.py::TestRoIAlign::test_roi_align_large_index -v

Existing tests — verify no regressions
pytest test/test_ops.py::TestRoIAlign -v

Fixes #8206

pytorch-bot · 2026-03-13T09:51:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9441

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f40a46d with merge base d7400a3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NicolasHug · 2026-03-16T15:46:56Z

test/test_ops.py

+        output_bytes = n_rois * channels * pooled_h * pooled_w * 4  # float32
+        if output_bytes > 9 * 1024**3:
+            pytest.skip("Test requires ~9 GB of memory")
+


all these values are statically defined. This if block is either always True or always False.

Thanks for pointing this out. I agree. I removed that part in the new commit.

NicolasHug · 2026-03-16T15:47:28Z

test/test_ops.py

+            x = torch.rand(num_imgs, channels, height, width, dtype=torch.float32, device=device)
+            rois = torch.zeros(n_rois, 5, dtype=torch.float32, device=device)
+        except RuntimeError:
+            pytest.skip("Not enough memory to allocate test tensors")


Please verify tests aren't not being skipped on the CI. If they pass, remove the try/except, if they don't, we'll have to consider other strategies to test this.

Thanks for the comment. I verified that the tests are not being skipped on the CI and my devserver. I removed the try/except in the new commit.

NicolasHug · 2026-03-16T15:49:45Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

 template <typename T>
 void roi_align_backward_kernel_impl(
-    int nthreads,
+    int64_t nthreads,


Can you explain why nthreads needs to be int64_t? It should never need to be that large? If it's for integer comparison to not warn, we could just cast?

Thanks for the question. nthreads in the backward kernel is grad.numel(), which equals n_rois × channels × pooled_h × pooled_w.

I changed to int nthreads in both .cpp and .cu files and ran the test (only the forward kernel test, no backward kernel test due to the large memory requirement). The CPU test passed but the CUDA test failed with all-zero output. The reason is that the CPU forward kernel doesn't use nthreads — it loops over n_rois separately, and the overflow is handled by the int64_t changes to index_n, index_n_c, and index. The CUDA forward kernel uses a flat loop with nthreads as the bound, so truncating to int caused nthreads to wrap to a negative value, making the loop condition immediately false and skipping all output computation — resulting in all-zero output.

The CPU backward kernel does use nthreads in the same flat-loop pattern as CUDA (for (int64_t index = 0; index < nthreads; ...)) and receives the same overflowing value via grad.numel(), so it needs int64_t for the same reason.

A backward-specific test would require large memory (output + grad_output + grad_input), which might be impractical for CI. Do we want to add one with a memory skip guard, or is the current forward-only test sufficient?

As for "If it's for integer comparison to not warn, we could just cast?", I think it isn't a warning issue. The problem is that the value could be actually large and gets truncated at the call site before the function body runs. In the author's reproducing example (batch_size=172, default 1000 proposals per image), nthreads is 172,000 × 256 × 7 × 7 = 2,157,568,000 > INT_MAX.

I added the backward kernel test in the latest commit. If I change int64_t nthreadsto int nthreads, the CPU backward test fails with all-zero gradients because nthreads gets truncated to a negative value and the loop never executes.

nthreads doesn't mean "number of threads" in this code, instead, it means "total number of output elements to process.", which could be very large.

Feel free to let me know if you have any questions.

Thanks. For my own ref:

nthreads is the name for output_size in the cuda kernel:

vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu

Lines 378 to 379 in d7400a3

roi_align_forward_kernel_impl<scalar_t><<<grid, block, 0, stream>>>(

output_size,

where output size was already inferred as uint64 since size() returns uint64:

vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu

Lines 354 to 362 in d7400a3

auto num_rois = rois.size(0);

auto channels = input.size(1);

auto height = input.size(2);

auto width = input.size(3);

at::Tensor output = at::zeros(

{num_rois, channels, pooled_height, pooled_width}, input.options());

auto output_size = num_rois * pooled_height * pooled_width * channels;

nthreads was previously implicitly cast to int32 when passed to roi_align_forward_kernel_impl causing an overflow. Now, with the change made to the function signature, it's properly kept as int64.

NicolasHug · 2026-03-23T09:42:43Z

test/test_ops.py


+    @pytest.mark.parametrize("device", cpu_and_cuda())
+    def test_roi_align_large_index(self, device):
+        """Regression test for https://github.com/pytorch/vision/issues/8206"""


Suggested change

"""Regression test for https://github.com/pytorch/vision/issues/8206"""

"""Non-regression test for https://github.com/pytorch/vision/issues/8206"""

Fixed in the new commit.

NicolasHug · 2026-03-23T09:44:38Z

test/test_ops.py

+
+        # Forward kernel test
+        assert result.shape == (n_rois, channels, pooled_h, pooled_w)
+        assert result.abs().sum() > 0, "roi_align returned all zeros — likely an index overflow bug"


Here and below, don't specify anything beyond the assert, pytest is already good at showing the right thing. Here, the message doesn't help that much either.

Suggested change

assert result.abs().sum() > 0, "roi_align returned all zeros — likely an index overflow bug"

assert result.abs().sum() > 0

Fixed in the new commit.

NicolasHug · 2026-03-23T10:19:35Z

torchvision/csrc/ops/cpu/roi_align_common.h

+          pc.pos1 = static_cast<int64_t>(y_low) * width + x_low;
+          pc.pos2 = static_cast<int64_t>(y_low) * width + x_high;
+          pc.pos3 = static_cast<int64_t>(y_high) * width + x_low;
+          pc.pos4 = static_cast<int64_t>(y_high) * width + x_high;


Can you double check that these casts are needed?

claude says: y_low and y_high are pixel coordinates bounded by height, and width is the image width. y_low * width + x_low is at most height * width, which is the number of pixels in a single channel of a single image. That's not going to overflow int.

Please check every single other change in this PR.

Good point. These casts are not needed and I fixed in the new commit. I checked other changes and fixed them accordingly.

NicolasHug · 2026-03-23T10:23:15Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

 template <typename T>
 void roi_align_backward_kernel_impl(
-    int nthreads,
+    int64_t nthreads,


Thanks. For my own ref:

nthreads is the name for output_size in the cuda kernel:

vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu

Lines 378 to 379 in d7400a3

roi_align_forward_kernel_impl<scalar_t><<<grid, block, 0, stream>>>(

output_size,

where output size was already inferred as uint64 since size() returns uint64:

vision/torchvision/csrc/ops/cuda/roi_align_kernel.cu

Lines 354 to 362 in d7400a3

auto num_rois = rois.size(0);

auto channels = input.size(1);

auto height = input.size(2);

auto width = input.size(3);

at::Tensor output = at::zeros(

{num_rois, channels, pooled_height, pooled_width}, input.options());

auto output_size = num_rois * pooled_height * pooled_width * channels;

nthreads was previously implicitly cast to int32 when passed to roi_align_forward_kernel_impl causing an overflow. Now, with the change made to the function signature, it's properly kept as int64.

test/test_ops.py

zy1git · 2026-03-24T10:14:32Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

+    int64_t index_n =
+        static_cast<int64_t>(n) * channels * pooled_width * pooled_height;


type and cast are good.

zy1git · 2026-03-24T10:15:20Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

-      const T* offset_input =
-          input + (roi_batch_ind * channels + c) * height * width;
+      int64_t index_n_c =
+          index_n + static_cast<int64_t>(c) * pooled_width * pooled_height;


casting could be removed in the new commit.

zy1git · 2026-03-24T10:22:57Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

+    int64_t n_stride,
+    int64_t c_stride,
+    int64_t h_stride,
+    int64_t w_stride) {


int64_t is used to be consistent with grad.stride() which returns int64_t, avoiding implicit narrowing conversions. However, the original code used int and the change is not necessary for fixing the bug, so I will use int instead of int64_t.

zy1git · 2026-03-24T10:24:06Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

+    int64_t c_stride,
+    int64_t h_stride,
+    int64_t w_stride) {
+  for (int64_t index = 0; index < nthreads; index++) {


int64_t is needed.

zy1git · 2026-03-24T10:25:16Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

-    T* offset_grad_input =
-        grad_input + ((roi_batch_ind * channels + c) * height * width);
+    T* offset_grad_input = grad_input +
+        ((static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width);


casting is needed.

zy1git · 2026-03-24T10:27:50Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

+        ((static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width);

-    int output_offset = n * n_stride + c * c_stride;
+    int64_t output_offset = static_cast<int64_t>(n) * n_stride + c * c_stride;


type and cast are good.

zy1git · 2026-03-24T10:28:33Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

 template <typename T>
 __global__ void roi_align_forward_kernel_impl(
-    int nthreads,
+    int64_t nthreads,


type is good

zy1git · 2026-03-24T10:29:57Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

    const T* rois,
    T* output) {
-  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+  CUDA_1D_KERNEL_LOOP_T(index, nthreads, int64_t) {


type is good

zy1git · 2026-03-24T10:30:41Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

-    const T* offset_input =
-        input + (roi_batch_ind * channels + c) * height * width;
+    const T* offset_input = input +
+        (static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width;


cast is needed here.

zy1git · 2026-03-24T10:31:27Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

 template <typename T>
 __global__ void roi_align_backward_kernel_impl(
-    int nthreads,
+    int64_t nthreads,


type is good.

zy1git · 2026-03-24T10:31:37Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

+    int64_t n_stride,
+    int64_t c_stride,
+    int64_t h_stride,
+    int64_t w_stride,


types are good.

zy1git · 2026-03-24T10:35:36Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

+    const int64_t output_offset =
+        static_cast<int64_t>(n) * n_stride + c * c_stride;


Type is good. The cast is technically redundant since n_stride is already int64_t and the multiplication would promote automatically. I kept it to make the 64-bit intent explicit.

zy1git · 2026-03-24T10:35:44Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

+    int64_t h_stride,
+    int64_t w_stride,
+    const int64_t memory_span) {
+  CUDA_1D_KERNEL_LOOP_T(index, nthreads, int64_t) {


type is good.

zy1git · 2026-03-24T10:36:56Z

torchvision/csrc/ops/cuda/roi_align_kernel.cu

+    const int64_t input_offset =
+        (static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width;


type and cast are good.

zy1git · 2026-03-24T11:19:00Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

+      int64_t index_n_c = index_n + c * pooled_width * pooled_height;
+      const T* offset_input = input +
+          (static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width;


type and cast are good

zy1git · 2026-03-24T11:19:21Z

torchvision/csrc/ops/cpu/roi_align_kernel.cpp

      for (int ph = 0; ph < pooled_height; ph++) {
        for (int pw = 0; pw < pooled_width; pw++) {
-          int index = index_n_c + ph * pooled_width + pw;
+          int64_t index = index_n_c + ph * pooled_width + pw;


type is good

Zhitao Yu added 2 commits March 13, 2026 02:38

fix the issue 8206 and add the test

8c71ea8

fix the issue 8206 and add the test

40b2276

meta-cla bot added the cla signed label Mar 13, 2026

remove unnecessary comments

d9ab5ce

zy1git marked this pull request as draft March 13, 2026 10:03

NicolasHug reviewed Mar 16, 2026

View reviewed changes

Zhitao Yu and others added 4 commits March 17, 2026 02:11

address the comments

544d960

add the backward kernel test

76f6a16

address the test failure

b6b7ab7

Merge branch 'main' into issue-8206

b0004a1

zy1git marked this pull request as ready for review March 20, 2026 08:27

NicolasHug reviewed Mar 23, 2026

View reviewed changes

Zhitao Yu and others added 3 commits March 24, 2026 02:49

address the comments

8ff88ed

skip the cpu test

bad6aba

Merge branch 'main' into issue-8206

e0b1a5b

zy1git commented Mar 24, 2026

View reviewed changes

Zhitao Yu added 2 commits March 24, 2026 03:48

check the type and cast

1d9d1d4

remove the unnecessary changes and stick to the original codes

f40a46d

zy1git commented Mar 24, 2026

View reviewed changes

	roi_align_forward_kernel_impl<scalar_t><<<grid, block, 0, stream>>>(
	output_size,

	auto num_rois = rois.size(0);
	auto channels = input.size(1);
	auto height = input.size(2);
	auto width = input.size(3);

	at::Tensor output = at::zeros(
	{num_rois, channels, pooled_height, pooled_width}, input.options());

	auto output_size = num_rois * pooled_height * pooled_width * channels;

	"""Regression test for https://github.com/pytorch/vision/issues/8206"""
	"""Non-regression test for https://github.com/pytorch/vision/issues/8206"""

	assert result.abs().sum() > 0, "roi_align returned all zeros — likely an index overflow bug"
	assert result.abs().sum() > 0

		int64_t index_n =
		static_cast<int64_t>(n) * channels * pooled_width * pooled_height;

		const int64_t output_offset =
		static_cast<int64_t>(n) * n_stride + c * c_stride;

		const int64_t input_offset =
		(static_cast<int64_t>(roi_batch_ind) * channels + c) * height * width;

Conversation

zy1git commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9441

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zy1git Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zy1git Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zy1git Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zy1git Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zy1git Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zy1git commented Mar 13, 2026 •

edited

Loading

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

NicolasHug Mar 16, 2026 •

edited

Loading

NicolasHug Mar 16, 2026 •

edited

Loading

zy1git Mar 17, 2026 •

edited

Loading

zy1git Mar 17, 2026 •

edited

Loading

zy1git Mar 24, 2026 •

edited

Loading

zy1git Mar 24, 2026 •

edited

Loading

zy1git Mar 24, 2026 •

edited

Loading